Skip to content

Conversation

@SamuelDegelia-NOAA
Copy link
Contributor

@SamuelDegelia-NOAA SamuelDegelia-NOAA commented Jan 28, 2026

Description

This PR modifies the workarounds for gsibec to force zero for the outer analysis grids in linear variable change section. This method no longer need to fill the background values in missing values analysis grids.

These modifications are needed to prevent nans in the cost function when running 3dvar on the na3km domain. Note that we are still seeing some issues with nans that can be prevented by limiting 3dvar to only a single outer loop. There will likely be more changes coming to the gsibec code to resolve this. But for now, this PR allows us to at least run one outer loop and start getting results.

Huge thanks to @Masanori-NOAA for debugging this problem and finding a (at least partial) solution.

Issue(s) addressed

None

Dependencies (if applicable)

None

Checklist

  • I have performed a self-review of my own code.
  • I have run rrfs tests before creating the PR (if applicable).
  • Unit tests added/updated (if applicable).

@rrfsbot
Copy link
Collaborator

rrfsbot commented Jan 28, 2026

FAILED on hera

started build_and_test on hera at UTC time: Wed Jan 28 02:27:11 UTC 2026
finished at UTC time: Wed Jan 28 02:58:43 UTC 2026

Test project /scratch3/NCEPDEV/fv3-cam/rrfsbot/PRs_RDASApp/527/build/rrfs-test
      Start  6: rrfs_fv3jedi_2024052700_getkf_observer
      Start 15: rrfs_mpasjedi_2024052700_getkf_observer
      Start  1: rrfs_fv3jedi_2024052700_3dvar
      Start  2: rrfs_fv3jedi_2024052700_3denvar
      Start  3: rrfs_fv3jedi_2024052700_3denvar_mgbf
      Start  4: rrfs_fv3jedi_2024052700_hybrid3denvar
      Start  5: rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf
      Start  8: rrfs_fv3jedi_2024052700_3dvar_conv_surface
 1/18 Test  #1: rrfs_fv3jedi_2024052700_3dvar .................   Passed   38.65 sec
      Start  9: rrfs_fv3jedi_2024052700_3dvar_conv_upperair
 2/18 Test  #8: rrfs_fv3jedi_2024052700_3dvar_conv_surface ....***Failed   76.86 sec
      Start 10: rrfs_fv3jedi_2024052700_3dvar_remote
 3/18 Test  #6: rrfs_fv3jedi_2024052700_getkf_observer ........   Passed   98.96 sec
      Start  7: rrfs_fv3jedi_2024052700_getkf_solver
 4/18 Test #10: rrfs_fv3jedi_2024052700_3dvar_remote ..........***Failed   44.29 sec
      Start 11: rrfs_fv3jedi_2024052700_3dvar_satrad
 5/18 Test  #9: rrfs_fv3jedi_2024052700_3dvar_conv_upperair ...***Failed  100.57 sec
      Start 12: rrfs_fv3jedi_2024052700_3denvar_refl
 6/18 Test  #2: rrfs_fv3jedi_2024052700_3denvar ...............   Passed  185.11 sec
      Start 13: rrfs_mpasjedi_2024052700_bumploc
 7/18 Test #11: rrfs_fv3jedi_2024052700_3dvar_satrad ..........***Failed   74.29 sec
      Start 14: rrfs_mpasjedi_2024052700_3denvar
 8/18 Test  #4: rrfs_fv3jedi_2024052700_hybrid3denvar .........***Failed  210.08 sec
      Start 17: rrfs_mpasjedi_2024052700_3dvar
 9/18 Test  #3: rrfs_fv3jedi_2024052700_3denvar_mgbf ..........   Passed  228.73 sec
      Start 18: rrfs_bufr2ioda_msonet
10/18 Test #18: rrfs_bufr2ioda_msonet .........................   Passed   26.71 sec
11/18 Test #17: rrfs_mpasjedi_2024052700_3dvar ................   Passed   57.35 sec
12/18 Test #15: rrfs_mpasjedi_2024052700_getkf_observer .......   Passed  268.66 sec
      Start 16: rrfs_mpasjedi_2024052700_getkf_solver
13/18 Test  #7: rrfs_fv3jedi_2024052700_getkf_solver ..........   Passed  192.44 sec
14/18 Test  #5: rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf ....***Failed  307.39 sec
15/18 Test #16: rrfs_mpasjedi_2024052700_getkf_solver .........   Passed  191.12 sec
16/18 Test #13: rrfs_mpasjedi_2024052700_bumploc ..............   Passed  310.41 sec
17/18 Test #14: rrfs_mpasjedi_2024052700_3denvar ..............   Passed  334.52 sec
18/18 Test #12: rrfs_fv3jedi_2024052700_3denvar_refl ..........   Passed  530.79 sec

67% tests passed, 6 tests failed out of 18

Label Time Summary:
mpi            = 3276.92 sec*proc (18 tests)
rdas-bundle    = 3276.92 sec*proc (18 tests)
script         = 3276.92 sec*proc (18 tests)

Total Test time (real) = 670.07 sec

The following tests FAILED:
	  4 - rrfs_fv3jedi_2024052700_hybrid3denvar (Failed)
	  5 - rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf (Failed)
	  8 - rrfs_fv3jedi_2024052700_3dvar_conv_surface (Failed)
	  9 - rrfs_fv3jedi_2024052700_3dvar_conv_upperair (Failed)
	 10 - rrfs_fv3jedi_2024052700_3dvar_remote (Failed)
	 11 - rrfs_fv3jedi_2024052700_3dvar_satrad (Failed)
Errors while running CTest
Output from these tests are in: /scratch3/NCEPDEV/fv3-cam/rrfsbot/PRs_RDASApp/527/build/rrfs-test/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

workdir: /scratch3/NCEPDEV/fv3-cam/rrfsbot/PRs_RDASApp/527

@SamuelDegelia-NOAA
Copy link
Contributor Author

PASSED on wcoss2

started build_and_test on wcoss2 at UTC time: Wed Jan 28 02:22:36 UTC 2026
finished at UTC time: Wed Jan 28 03:18:52 UTC 2026

Test project /lfs/h2/emc/da/noscrub/samuel.degelia/rrfsbot/PRs_RDASApp/527/build/rrfs-test
      Start  6: rrfs_fv3jedi_2024052700_getkf_observer
      Start 15: rrfs_mpasjedi_2024052700_getkf_observer
      Start  1: rrfs_fv3jedi_2024052700_3dvar
      Start  2: rrfs_fv3jedi_2024052700_3denvar
      Start  3: rrfs_fv3jedi_2024052700_3denvar_mgbf
      Start  4: rrfs_fv3jedi_2024052700_hybrid3denvar
      Start  5: rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf
      Start  8: rrfs_fv3jedi_2024052700_3dvar_conv_surface
      Start  9: rrfs_fv3jedi_2024052700_3dvar_conv_upperair
      Start 10: rrfs_fv3jedi_2024052700_3dvar_remote
 1/18 Test  #8: rrfs_fv3jedi_2024052700_3dvar_conv_surface ....   Passed  180.90 sec
      Start 11: rrfs_fv3jedi_2024052700_3dvar_satrad
 2/18 Test  #1: rrfs_fv3jedi_2024052700_3dvar .................   Passed  230.07 sec
      Start 12: rrfs_fv3jedi_2024052700_3denvar_refl
 3/18 Test  #9: rrfs_fv3jedi_2024052700_3dvar_conv_upperair ...   Passed  230.05 sec
      Start 13: rrfs_mpasjedi_2024052700_bumploc
 4/18 Test #10: rrfs_fv3jedi_2024052700_3dvar_remote ..........   Passed  278.70 sec
      Start 14: rrfs_mpasjedi_2024052700_3denvar
 5/18 Test  #6: rrfs_fv3jedi_2024052700_getkf_observer ........   Passed  355.93 sec
      Start  7: rrfs_fv3jedi_2024052700_getkf_solver
 6/18 Test #11: rrfs_fv3jedi_2024052700_3dvar_satrad ..........   Passed  186.04 sec
      Start 17: rrfs_mpasjedi_2024052700_3dvar
 7/18 Test  #2: rrfs_fv3jedi_2024052700_3denvar ...............   Passed  461.93 sec
      Start 18: rrfs_bufr2ioda_msonet
 8/18 Test  #4: rrfs_fv3jedi_2024052700_hybrid3denvar .........   Passed  465.92 sec
 9/18 Test #17: rrfs_mpasjedi_2024052700_3dvar ................   Passed  116.11 sec
10/18 Test #18: rrfs_bufr2ioda_msonet .........................   Passed   33.96 sec
11/18 Test  #5: rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf ....   Passed  545.91 sec
12/18 Test  #3: rrfs_fv3jedi_2024052700_3denvar_mgbf ..........   Passed  555.91 sec
13/18 Test #13: rrfs_mpasjedi_2024052700_bumploc ..............   Passed  347.00 sec
14/18 Test  #7: rrfs_fv3jedi_2024052700_getkf_solver ..........   Passed  224.99 sec
15/18 Test #15: rrfs_mpasjedi_2024052700_getkf_observer .......   Passed  727.19 sec
      Start 16: rrfs_mpasjedi_2024052700_getkf_solver
16/18 Test #14: rrfs_mpasjedi_2024052700_3denvar ..............   Passed  489.27 sec
17/18 Test #12: rrfs_fv3jedi_2024052700_3denvar_refl ..........   Passed  718.87 sec
18/18 Test #16: rrfs_mpasjedi_2024052700_getkf_solver .........   Passed  354.58 sec

100% tests passed, 0 tests failed out of 18

Label Time Summary:
rdas-bundle    = 6503.32 sec*proc (18 tests)
script         = 6503.32 sec*proc (18 tests)

Total Test time (real) = 1081.98 sec

workdir: /lfs/h2/emc/da/noscrub/samuel.degelia/rrfsbot/PRs_RDASApp/527

@SamuelDegelia-NOAA SamuelDegelia-NOAA marked this pull request as draft January 28, 2026 03:22
@SamuelDegelia-NOAA
Copy link
Contributor Author

Converting to draft. For some reason it looks like this change is actually causing NaNs in the cost function on Hera but resolving them on WCOSS2...

Copy link

@ShunLiu-NOAA ShunLiu-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized some ctests did not passed, so I am reverting my approval.

@SamuelDegelia-NOAA
Copy link
Contributor Author

SamuelDegelia-NOAA commented Jan 28, 2026

It looks like the Hera failures occur due to NaNs now appearing on the second outer loop. So these changes are resolving NaNs during the first outer loop for the na3km case, but sometimes cause NaNs during the second outer loop for both conus13km and na3km (machine dependent). Will probably need help from @TingLei-NOAA and @Masanori-NOAA to figure this one out.

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Jan 28, 2026

Yes, the nan values in cost function could come from the nan values or values in background fields from undefined behavior (when compiling optimization is turned on) on filtering grids outside of the model domain. That could occur earlier in the regional gsibec because of the lateral boundary points , even earlier than @Masanori-NOAA once suspected. For example, the ges_prsl calculation in guess_grids.f90 of gsibec. I am still trying to figure out a simple way to deal with situation. I would focus on this issue, in collaboration with Masanori and colleagues after I finish the current optimization of MGBF codes.

@SamuelDegelia-NOAA
Copy link
Contributor Author

Thanks, @TingLei-NOAA!

@SamuelDegelia-NOAA
Copy link
Contributor Author

I added some debug prints to track down the source of the NaNs in the minimizer. Tracing through various layers, I found that the NaNs originate in normal_rh_to_q.f90 within gsibec. At certain grid points, the derivative dqdrh is already non-finite because t and p are zero. This leads to NaNs when q is computed.

As a simple hardening step, I added additional checks in normal_rh_to_q (and the adjoint) to treat these points as invalid and skip them, rather than relying only on the existing ges_tsen < rmiss_th condition. Here is the general idea:

real(r_kind), parameter :: pmin = 1.0e-6_r_kind
real(r_kind), parameter :: tmin = 1.0_r_kind

# if(regional .and. ges_tsen(i,j,k,ntguessig) < rmiss_th) then # old check
if (regional .and. ( &
    (ges_tsen(i,j,k,ntguessig) < rmiss_th) .or. &
    (.not. ieee_is_finite(t(i,j,k)))   .or. (t(i,j,k)   <= tmin) .or. &
    (.not. ieee_is_finite(p(i,j,k)))   .or. (p(i,j,k)   <= pmin) .or. &
    (.not. ieee_is_finite(p(i,j,k+1))) .or. (p(i,j,k+1) <= pmin) )) then
  q(i,j,k) = zero
  cycle
endif

After changing this if-block, 3dvar now runs full thoroughly on Hera. The minimization results are slightly different though after this change (e.g., different reduction of residual norm). I am going to make some plots to see how similar the analyses are and if this fix is okay.

@SamuelDegelia-NOAA
Copy link
Contributor Author

3dvar run through after the above changes but the analyses are very different. Going to continue debugging.

@rrfsbot
Copy link
Collaborator

rrfsbot commented Jan 29, 2026

PASSED on hera

started build_and_test on hera at UTC time: Thu Jan 29 18:42:15 UTC 2026
finished at UTC time: Thu Jan 29 19:11:12 UTC 2026

Test project /scratch3/NCEPDEV/fv3-cam/rrfsbot/PRs_RDASApp/527/build/rrfs-test
      Start  6: rrfs_fv3jedi_2024052700_getkf_observer
      Start 15: rrfs_mpasjedi_2024052700_getkf_observer
      Start  1: rrfs_fv3jedi_2024052700_3dvar
      Start  2: rrfs_fv3jedi_2024052700_3denvar
      Start  3: rrfs_fv3jedi_2024052700_3denvar_mgbf
      Start  4: rrfs_fv3jedi_2024052700_hybrid3denvar
      Start  5: rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf
      Start  8: rrfs_fv3jedi_2024052700_3dvar_conv_surface
 1/18 Test  #1: rrfs_fv3jedi_2024052700_3dvar .................   Passed   59.81 sec
      Start  9: rrfs_fv3jedi_2024052700_3dvar_conv_upperair
 2/18 Test  #8: rrfs_fv3jedi_2024052700_3dvar_conv_surface ....   Passed   77.56 sec
      Start 10: rrfs_fv3jedi_2024052700_3dvar_remote
 3/18 Test  #6: rrfs_fv3jedi_2024052700_getkf_observer ........   Passed   82.63 sec
      Start  7: rrfs_fv3jedi_2024052700_getkf_solver
 4/18 Test #10: rrfs_fv3jedi_2024052700_3dvar_remote ..........   Passed   20.78 sec
      Start 11: rrfs_fv3jedi_2024052700_3dvar_satrad
 5/18 Test  #9: rrfs_fv3jedi_2024052700_3dvar_conv_upperair ...   Passed   47.07 sec
      Start 12: rrfs_fv3jedi_2024052700_3denvar_refl
 6/18 Test #11: rrfs_fv3jedi_2024052700_3dvar_satrad ..........   Passed   61.65 sec
      Start 13: rrfs_mpasjedi_2024052700_bumploc
 7/18 Test  #5: rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf ....   Passed  164.58 sec
      Start 14: rrfs_mpasjedi_2024052700_3denvar
 8/18 Test  #3: rrfs_fv3jedi_2024052700_3denvar_mgbf ..........   Passed  174.60 sec
      Start 17: rrfs_mpasjedi_2024052700_3dvar
 9/18 Test  #4: rrfs_fv3jedi_2024052700_hybrid3denvar .........   Passed  192.15 sec
      Start 18: rrfs_bufr2ioda_msonet
10/18 Test #18: rrfs_bufr2ioda_msonet .........................   Passed   25.44 sec
11/18 Test  #2: rrfs_fv3jedi_2024052700_3denvar ...............   Passed  221.28 sec
12/18 Test #17: rrfs_mpasjedi_2024052700_3dvar ................   Passed   49.51 sec
13/18 Test  #7: rrfs_fv3jedi_2024052700_getkf_solver ..........   Passed  144.26 sec
14/18 Test #15: rrfs_mpasjedi_2024052700_getkf_observer .......   Passed  282.36 sec
      Start 16: rrfs_mpasjedi_2024052700_getkf_solver
15/18 Test #14: rrfs_mpasjedi_2024052700_3denvar ..............   Passed  271.67 sec
16/18 Test #16: rrfs_mpasjedi_2024052700_getkf_solver .........   Passed  169.80 sec
17/18 Test #12: rrfs_fv3jedi_2024052700_3denvar_refl ..........   Passed  388.02 sec
18/18 Test #13: rrfs_mpasjedi_2024052700_bumploc ..............   Passed  342.21 sec

100% tests passed, 0 tests failed out of 18

Label Time Summary:
mpi            = 2775.39 sec*proc (18 tests)
rdas-bundle    = 2775.39 sec*proc (18 tests)
script         = 2775.39 sec*proc (18 tests)

Total Test time (real) = 502.24 sec

workdir: /scratch3/NCEPDEV/fv3-cam/rrfsbot/PRs_RDASApp/527

@SamuelDegelia-NOAA
Copy link
Contributor Author

PASSED on wcoss2

started build_and_test on wcoss2 at UTC time: Thu Jan 29 18:38:26 UTC 2026
finished at UTC time: Thu Jan 29 19:31:26 UTC 2026

Test project /lfs/h2/emc/da/noscrub/samuel.degelia/rrfsbot/PRs_RDASApp/527/build/rrfs-test
      Start  6: rrfs_fv3jedi_2024052700_getkf_observer
      Start 15: rrfs_mpasjedi_2024052700_getkf_observer
      Start  1: rrfs_fv3jedi_2024052700_3dvar
      Start  2: rrfs_fv3jedi_2024052700_3denvar
      Start  3: rrfs_fv3jedi_2024052700_3denvar_mgbf
      Start  4: rrfs_fv3jedi_2024052700_hybrid3denvar
      Start  5: rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf
      Start  8: rrfs_fv3jedi_2024052700_3dvar_conv_surface
      Start  9: rrfs_fv3jedi_2024052700_3dvar_conv_upperair
      Start 10: rrfs_fv3jedi_2024052700_3dvar_remote
 1/18 Test #10: rrfs_fv3jedi_2024052700_3dvar_remote ..........   Passed   79.02 sec
      Start 11: rrfs_fv3jedi_2024052700_3dvar_satrad
 2/18 Test  #1: rrfs_fv3jedi_2024052700_3dvar .................   Passed   88.95 sec
      Start 12: rrfs_fv3jedi_2024052700_3denvar_refl
 3/18 Test  #9: rrfs_fv3jedi_2024052700_3dvar_conv_upperair ...   Passed  100.86 sec
      Start 13: rrfs_mpasjedi_2024052700_bumploc
 4/18 Test  #8: rrfs_fv3jedi_2024052700_3dvar_conv_surface ....   Passed  103.87 sec
      Start 14: rrfs_mpasjedi_2024052700_3denvar
 5/18 Test  #6: rrfs_fv3jedi_2024052700_getkf_observer ........   Passed  134.89 sec
      Start  7: rrfs_fv3jedi_2024052700_getkf_solver
 6/18 Test #11: rrfs_fv3jedi_2024052700_3dvar_satrad ..........   Passed  133.83 sec
      Start 17: rrfs_mpasjedi_2024052700_3dvar
 7/18 Test  #2: rrfs_fv3jedi_2024052700_3denvar ...............   Passed  249.86 sec
      Start 18: rrfs_bufr2ioda_msonet
 8/18 Test  #4: rrfs_fv3jedi_2024052700_hybrid3denvar .........   Passed  251.85 sec
 9/18 Test #18: rrfs_bufr2ioda_msonet .........................   Passed   35.75 sec
10/18 Test  #3: rrfs_fv3jedi_2024052700_3denvar_mgbf ..........   Passed  295.86 sec
11/18 Test  #5: rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf ....   Passed  307.85 sec
12/18 Test  #7: rrfs_fv3jedi_2024052700_getkf_solver ..........   Passed  178.00 sec
13/18 Test #17: rrfs_mpasjedi_2024052700_3dvar ................   Passed  118.01 sec
14/18 Test #13: rrfs_mpasjedi_2024052700_bumploc ..............   Passed  361.99 sec
15/18 Test #15: rrfs_mpasjedi_2024052700_getkf_observer .......   Passed  464.88 sec
      Start 16: rrfs_mpasjedi_2024052700_getkf_solver
16/18 Test #14: rrfs_mpasjedi_2024052700_3denvar ..............   Passed  485.04 sec
17/18 Test #12: rrfs_fv3jedi_2024052700_3denvar_refl ..........   Passed  660.90 sec
18/18 Test #16: rrfs_mpasjedi_2024052700_getkf_solver .........   Passed  325.96 sec

100% tests passed, 0 tests failed out of 18

Label Time Summary:
rdas-bundle    = 4377.36 sec*proc (18 tests)
script         = 4377.36 sec*proc (18 tests)

Total Test time (real) = 790.92 sec

workdir: /lfs/h2/emc/da/noscrub/samuel.degelia/rrfsbot/PRs_RDASApp/527

@SamuelDegelia-NOAA
Copy link
Contributor Author

I reverted the code changes in this PR to guess_grids.f90. That means we keep this code that checks for any missing values and fills them in with the first valid value in the subdomain. After that change, the ctests are passing on Hera and WCOSS2 and we are able to run 3dvar on na3km for the first outer loop.

The NaNs still show up starting on the second outer loop for na3km. But I think this PR is okay for now just to get things running while @TingLei-NOAA and @Masanori-NOAA work on a generic solution to resolve the NaNs.

@SamuelDegelia-NOAA SamuelDegelia-NOAA marked this pull request as ready for review January 29, 2026 21:02
ShunLiu-NOAA
ShunLiu-NOAA previously approved these changes Jan 29, 2026
@SamuelDegelia-NOAA
Copy link
Contributor Author

Sorry @ShunLiu-NOAA, I just had one more quick commit to add a comment to build.sh about opening a PR for one of the workarounds. That dismissed your review.

@ShunLiu-NOAA ShunLiu-NOAA self-requested a review January 29, 2026 21:27
@ShunLiu-NOAA
Copy link

no problem. Thanks for working on this.

@ShunLiu-NOAA ShunLiu-NOAA merged commit 3365fa9 into NOAA-EMC:develop Jan 30, 2026
1 check passed
@SamuelDegelia-NOAA SamuelDegelia-NOAA deleted the feature/update_gsibec_lvc branch January 30, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants